Correction of Noisy Sentences using a Monolingual Corpus

نویسنده

  • Diptesh Chatterhee
چکیده

Correction of Noisy Natural Language Text is an important and well studied problem in Natural Language Processing. It has a number of applications in domains like Statistical Machine Translation, Second Language Learning and Natural Language Generation. In this work, we consider some statistical techniques for Text Correction. We define the classes of errors commonly found in text and describe algorithms to correct them. The data has been taken from a poorly trained Machine Translation system. The algorithms use only a language model in the target language in order to correct the sentences. We use phrase based correction methods in both the algorithms. The phrases are replaced and combined to give us the final corrected sentence. We also present the methods to model different kinds of errors, in addition to results of the working of the algorithms on the test set. We show that one of the approaches fail to achieve the desired goal, whereas the other succeeds well. In the end, we analyze the possible reasons for such a trend in performance. Chapter

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Translation Fluency with Search-Based Decoding and a Monolingual Statistical Machine Translation Model for Automatic Post-Editing

The BLEU scores and translation fluency for the current state-of-the-art SMT systems based on IBM models are still too low for publication purposes. The major issue is that stochastically generated sentences hypotheses, produced through a stack decoding process, may not strictly follow the natural target language grammar, since the decoding process is directed by a highly simplified translation...

متن کامل

An Efficient Technique for De-Noising Sentences using Monolingual Corpus and Synonym Dictionary

We describe a method of correcting noisy output of a machine translation system. Our idea is to consider di erent phrases of a given sentence, and nd appropriate replacements of some of these from the frequently occurring similar phrases in the monolingual corpus. The frequent phrases in the monolingual corpus are indexed by a search engine. When looking for similar phrases we consider phrases ...

متن کامل

Collocation Extraction Using Monolingual Word Alignment Method

Statistical bilingual word alignment has been well studied in the context of machine translation. This paper adapts the bilingual word alignment algorithm to monolingual scenario to extract collocations from monolingual corpus. The monolingual corpus is first replicated to generate a parallel corpus, where each sentence pair consists of two identical sentences in the same language. Then the mon...

متن کامل

Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, ...

متن کامل

Method for Retrieving a Similar Sentence and Its Application to Machine Translation

In this paper, we propose incorporating similar sentence retrieval in machine translation to improve the translation of hard-to-translate input sentences. If a given input sentence is hard to translate, a sentence similar to the input sentence is retrieved from a monolingual corpus of translatable sentences and then provided to the MT system instead of the original sentence. This method is adva...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1105.4318  شماره 

صفحات  -

تاریخ انتشار 2011